Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

102

Probability and Likelihood

9.2.1

Generalized Union

The event that at least one ofupper NN eventsupper A 1 comma upper A 2 comma ellipsis comma upper A Subscript upper N BaselineA1, A2, . . . , AN occurs (i.e.,upper A equals upper A 1 union upper A 2 union midline horizontal ellipsis union upper A Subscript upper NA = A1 ∪A2 ∪

· · · ∪AN) needs information not only about the individual events but also about all

possible overlaps.

Theorem. The probability upper P 1P1 of the realization of at least one among the events

upper A 1 comma upper A 2 comma ellipsis comma upper A Subscript upper N BaselineA1, A2, . . . , AN is given by

upper P 1 equals upper S 1 minus upper S 2 plus upper S 3 minus upper S 4 plus midline horizontal ellipsis plus or minus upper S Subscript upper N Baseline commaP1 = S1 −S2 + S3 −S4 + · · · ± SN ,

(9.8)

where theupper S Subscript rSr are deﬁned as the sums of all probabilities withrr subscripts (e.g.,upper S 1 equals sigma summation p Subscript i Baseline comma upper S 2 equals sigma summation p Subscript i j BaselineS1 =

E pi , S2 = E pi j, and i less than j less than k less than midline horizontal ellipsis less than or equals upper Ni < j < k < · · · ≤N) so that each contribution appears

only once; hence, each sumupper S Subscript rSr hasStartBinomialOrMatrix upper N Choose r EndBinomialOrMatrix

)

terms, and the last termupper S Subscript upper NSN gives the probability

of the simultaneous realization of all terms. ⁶

This result can be used to solve an old problem. Consider two sequences of upper NN

unique symbols differing only in the order of occurrence of the symbols and which

are then compared, symbol by symbol. What is the probabilityupper P 1P1 that there is at least

one match? Let upper A Subscript kAk be the event that a match occurs at the kkth position. Therefore,

symbol numberkk is at thekkth place, and the remainingupper N minus 1N −1 are anywhere; hence,

StartLayout 1st Row p Subscript k Baseline equals StartFraction left parenthesis upper N minus 1 right parenthesis factorial Over upper N factorial EndFraction equals StartFraction 1 Over upper N EndFraction comma EndLayoutpk = ⁽^N⁻¹⁾^!

= ¹

N ^,

and for every combination i comma ji, j,

StartLayout 1st Row p Subscript i j Baseline equals StartFraction left parenthesis upper N minus 2 right parenthesis factorial Over upper N factorial EndFraction equals StartFraction 1 Over upper N left parenthesis upper N minus 1 right parenthesis EndFraction period EndLayoutpi j = ⁽^N⁻²⁾^!

N(N −1) ^.

Each term in upper S Subscript rSr in Eq. (9.8) equals left parenthesis upper N minus r right parenthesis factorial slash upper N factorial(N −r)!/N! and therefore 1 divided by r factorial1/r!; therefore,

upper P 1 equals 1 minus StartFraction 1 Over 2 factorial EndFraction plus StartFraction 1 Over 3 factorial EndFraction minus plus midline horizontal ellipsis plus or minus StartFraction 1 Over upper N factorial EndFraction periodP1 = 1 −¹

2! ⁺¹

3! ^{−+ · · · ±}¹

N! ^.

(9.9)

One might recognize that1 minus upper P 11 −P1 represents the ﬁrstupper N plus 1N + 1 terms in the expansion of

1 divided by e1/e; hence, upper P 1 almost equals 1 minus 1 divided by e almost equals 0.632P1 ≈1 −1/e ≈0.632. It seems rather remarkable that upper P 1P1 is indepen-

dent of upper NN. For problems of matching genes and the like, it is useful to consider an

extension, that for any integer1 less than or equals m less than or equals upper N1 ≤m ≤N the probabilityupper P Subscript left bracket m right bracketP[m] that exactlymm among

the upper NN events upper A 1 commaA1,…comma upper A Subscript upper N Baseline, AN occur simultaneously isSuperscript 6⁶

upper P Subscript left bracket m right bracket Baseline equals upper S Subscript m Baseline minus StartBinomialOrMatrix m plus 1 Choose m EndBinomialOrMatrix upper S Subscript m plus 1 Baseline plus StartBinomialOrMatrix m plus 2 Choose m EndBinomialOrMatrix upper S Subscript m plus 2 Baseline minus plus midline horizontal ellipsis plus or minus StartBinomialOrMatrix upper N Choose m EndBinomialOrMatrix upper S Subscript upper NP[m] = Sm −

(m + 1

)

Sm+1 +

(m + 2

)

Sm+2 −+ · · · ±

)

(9.10)

and

6 The proof is given in Feller (1967), Chap. 4.